For S3D and 4D meta-analysis see (K. Kim, et al. 2017).

Cluster seperation and task evalution

In general defining feature importance is a common task. Though it is typically model dependant is

Measuring which variables are sensitive to cluster seperation

We take inspiration from the Scree plot and try to apply it to the LDA-like approach Consider a sree plot of flea data. This shows which componets are contributing to the full sample, full dimensionality, \([n,p]\) variation of the data.

The user study task trys to explore the full sample, full dimensionality \([n,p]\) seperation of two specified clusters. In an analogous manner, create a screeplot-like output to evalute the contributions of the variables the cluster seperation described in the data. This is related to what R Fisher is attempting in his 1936 paper on discriminant analysis. Similarly we start by finding cluster means and covariances.

Cluster means:

tars1 tars2 head aede1 aede2 aede3
Cluster means of: Concinna 183.1 129.6 51.24 146.2 14.10 104.9
Cluster means of: Heptapot. 138.2 125.1 51.59 138.3 10.09 106.6
Cluster means of: Heikert. 201.0 119.3 48.87 124.7 14.29 81.0

Cluster variance-covariance matrices: For Concinna , Heptapot. , Heikert. respectively

tars1 tars2 head aede1 aede2 aede3
tars1 147.49 66.638 18.5262 15.081 -5.2095 14.214
tars2 66.64 51.248 11.5452 2.476 -1.8119 3.093
head 18.53 11.545 4.9905 5.852 -0.5238 5.486
aede1 15.08 2.476 5.8524 31.662 -0.9690 15.629
aede2 -5.21 -1.812 -0.5238 -0.969 0.7905 -1.986
aede3 14.21 3.093 5.4857 15.629 -1.9857 38.229
tars1 tars2 head aede1 aede2 aede3
tars1 87.3268 44.5498 20.5260 19.1732 -0.7359 15.2879
tars2 44.5498 73.0390 15.7056 14.0216 -0.3896 21.2294
head 20.5260 15.7056 8.0628 8.2121 -0.2944 4.9675
aede1 19.1732 14.0216 8.2121 17.1602 -0.5022 7.9264
aede2 -0.7359 -0.3896 -0.2944 -0.5022 0.9437 0.2771
aede3 15.2879 21.2294 4.9675 7.9264 0.2771 34.2532
tars1 tars2 head aede1 aede2 aede3
tars1 222.133 63.4000 22.6000 30.3667 4.3667 29.467
tars2 63.400 44.1591 7.9097 11.8183 0.3366 11.467
head 22.600 7.9097 5.5161 5.6860 0.0054 4.233
aede1 30.367 11.8183 5.6860 21.3699 -0.3269 11.700
aede2 4.367 0.3366 0.0054 -0.3269 1.2129 1.267
aede3 29.467 11.4667 4.2333 11.7000 1.2667 79.733

Suppose the clusters in questions are Concinna and Heptapot. The line between the the cluster means of these groups is their difference. This is sufficeint for Linear Discriminant Analysis which assumes homogenious variation between clusters. We start from Fisher’s Discriminant Analysis, which accounts for within cluster variance.

\[ Cluster Seperation_{[1,p]} = (\mu_{b[1,p]} - \mu_{a[1,p]})^2~/~(\Sigma_{a[p, p]} + \Sigma_{b[p, p]})~~~;~a,~b~are~clusters \in X_{[n,p]}\]

They we alter the sum of the within cluster covariances to its pooled equivalant (take the weighted average of them.)

\[ Cluster Seperation_{[1,p]} = (\mu_{b[1,p]} - \mu_{a[1,p]})^2~/~ (\Sigma_{a[p, p]} * n_a + \Sigma_{b[p, p]} * n_b) / (n_a + n_b) ~~~;~a,~b~are~clusters \in X_{[n,p]}\]

var var_clSep cumsum_clSep
aede1 0.28 0.28
aede2 0.25 0.53
tars2 0.19 0.73
tars1 0.13 0.86
aede3 0.07 0.93
head 0.07 1.00

We discard the sign as we only care about magnitude each variable contributed to the seperation of the specified clusters. We scale the absolute terms by the inverse of the sumation. Now lets visualize this similar to the screeplot.

Evaluating the response

Now that we have a measure we want to define an objective cutoff for evaluation. We want the measure to a few attributes:

  • Continuous relative to the cluster seperation
  • Sum of squares should equal 1
  • Symetric, diverging around uniform weight

Following these, we define a measure to be: \[diff_i = ClusterSeperation_i - (1 / (p - 1))\] \[marks = \sum_{i=1}^{p} I(response_i) * sgn(diff_i) * \sqrt{|diff_i|}\]

Here, we add lines indicating the weight of each variable if selected as important. we then apply our measure to evalue task responses, we review an example response below:

variable var_clSep diff weight exampleResponse marks
aede1 0.28 0.08 0.28 1 0.28
aede2 0.25 0.05 0.23 1 0.23
tars2 0.19 -0.01 -0.08 0 0.00
tars1 0.13 -0.07 -0.27 1 -0.27
aede3 0.07 -0.13 -0.35 0 0.00
head 0.07 -0.13 -0.36 1 -0.36

Total marks = -0.12

Projected data view

All linear projections are nesciarily a lossy representation of the full data. By this we mean that no single 2D frame can show the whole set of infromation for \(p>=3\) -dimensional sample. Any pair of Pricipal Components nessciaronly shows less than all the variation, namely the sum of their contributions, typicaly stated as percentage of full sample variation. Analogously any single projection cannot show the full information explain the cluster seperation of 2 given clusters.

In applcation, viewing a PC1 by PC2 biplot of flea data contains 94.72 percent of the variation explained in the sample. While viewing (an orthogonal project) the top 2 variables (namely: aede1, aede2 ) explain 53.26 percent of the within sample cluster seperation between Concinna and Heptapot.

Cluster seperation on single-variable permuted data

Application to other toy sets

In order to stress test this Cluster seperation viewed by a screeplot we apply it to other toy datasets.

Penguins, between levels of spieces

Penguins, between levels of sex (invalid)

(invalid assumptions, as there are 3 species clusters for each sex) ### Penguins, between levels of sex with 1 species

Wine, between levels of type of wine

Breastcancer, between benign/malignant tumors

Olive, between levels of region of Italy

Rat CNS gene expression, between levels of “the high-level classes”

Testing our expectations

Can we simulate the Cluster seperation that we expect? Lets create a simmulation that has variable contributions for the following cases:

Observe how changing the variance-covariances changes cluster seperation given that cluster means differ as 80, 20, rep(0) (singal from means is large relative to variance) 1. 2 varaibles 2. 5 varaibles 3. 5 variables, within each cluster V1-V2 covariance set to .3 4. 5 variables, Cluster 1 covariance: all off diagonal set to .7, diagonals set to 5. Cluster 1 covariance: diag(5)

Simulation difficulty

In order to properly distinguish a difference between the 3 vizualization factors the data must be of suitable complexity, such that it has the following properties:

  1. Must be complex enough not to see within the any pair of the first 4 Principal Components; such that PCA is not sufficent for exploring cluster seperation
  2. Must not be so complex as to preclude any meaningful response given the factor visuals and time constraints.

Lets try to evaluate our current generation of data simulations against these properties

User study simulation 301

This was a 300 series simulation done at the end of the generation 1 user study shiny app.

PCA

Seems sufficent to be complex enough not to be seen as a pair of components within the first 4 Principal Components. Now to see if we can see anything in radial tours of all variables. We view cl Sep to explore which variables should contain contributions.

Cluster Seperation

Radial tour gifs for each manip var

References

Fisher, Ronald A. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7, no. 2 (September 1936): 179-88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.